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Abstract 

In  this  paper,  we  propose  a  single  low¬ 
dimensional  representation  of  a  large  collec¬ 
tion  of  table  and  hyponym  data,  and  show 
that  with  a  small  number  of  primitive  oper¬ 
ations,  this  representation  can  be  used  effec¬ 
tively  for  many  purposes.  Specifically  we  con¬ 
sider  queries  like  set  expansion,  class  predic¬ 
tion  etc.  We  evaluate  our  methods  on  pub¬ 
licly  available  semi-structured  datasets  from 
the  Web. 

1  Introduction 

Semi-structured  data  extracted  from  the  web  (in 
some  cases  extended  with  hyponym  data  derived 
from  Hearst  patterns  like  “X  such  as  Y”)  have  been 
used  in  several  tasks,  including  set  expansion  (Wang 
and  Cohen,  2009b;  Dalvi  et  al.,  2010)  automatic  set- 
instance  acquisition  (Wang  and  Cohen,  2009a),  fact 
extraction  (Dalvi  et  al.,  2012;  Talukdar  et  al.,  2008)), 
and  semi-supervised  learning  of  concepts  (Carlson 
et  al.,  2010).  In  past  work,  these  tasks  have  been  ad¬ 
dressed  using  different  methods  and  data  structures. 
In  this  paper,  we  propose  a  single  low-dimensional 
representation  of  a  large  collection  of  table  and  hy¬ 
ponym  data,  and  show  that  with  a  small  number  of 
primitive  operations,  this  representation  can  be  used 
effectively  for  many  purposes. 

In  particular,  we  propose  a  low-dimensional  rep¬ 
resentation  for  entities  based  on  the  embedding  used 
by  the  PIC  algorithm  (Lin  and  Cohen,  2010a).  PIC 
assigns  each  node  in  a  graph  an  initial  random  value, 
and  then  performs  an  iterative  update  which  brings 
together  the  values  assigned  to  near-by  nodes,  thus 
producing  a  one -dimensional  embedding  of  a  graph. 
In  past  work,  PIC  has  been  used  for  unsupervised 
clustering  of  graphs  (Lin  and  Cohen,  2010a);  it  has 


also  been  extended  to  bipartite  graphs  (Lin  and  Co¬ 
hen,  2010b),  and  it  has  been  shown  that  performance 
can  be  improved  by  using  multiple  random  starting 
points,  thus  producing  a  low-dimensional  (but  not 
one-dimensional)  embedding  of  a  graph  (Balasubra- 
manyan  et  al.,  2010). 

2  The  PIC3  Representation 

Entity-suchas  Entity-column 


We  use  PIC  to  produce  an  embedding  of  a  tripar¬ 
tite  graph,  in  particular  the  data  graph  of  Figure  1. 
We  use  the  publicly  available  (Dalvi  et  al.,  2012) 
entity-tableColumn  co-occurrence  dataset  and  Hy¬ 
ponym  Concept  dataset.  Each  edge  derived  from 
the  entity-tableColumn  dataset  links  an  entity  name 
with  an  identifier  for  a  table  column  in  which  the 
entity  name  appeared.  Each  edge  derived  from  the 
Hyponym  Concept  Dataset  links  an  entity  X  and  a 
concept  Y  with  which  it  appeared  in  the  context  of 


a  Hearst  pattern  (weighted  by  frequency  in  a  large 
web  coipus).  We  combine  these  edges  to  form  a  tri¬ 
partite  graph,  as  shown  in  Figure  1.  Occurrences 
of  entities  with  hyponym  (or  “such  as”)  concepts 
form  a  bipartite  graph  on  the  left,  and  occurrences 
of  entities  in  various  table-columns  form  the  bipar¬ 
tite  graph  on  the  right.  Our  hypothesis  is  that  enti¬ 
ties  co-occurring  in  multiple  table  columns  or  with 
similar  suchas  concepts  probably  belong  to  the  same 
class  label. 

Since  we  have  two  bipartite  graphs,  entity- 
tableColumn  and  entity-suchasConcept,  we  create 
bipartite  PIC  embeddings  for  each  of  these  in  turn 
(retaining  only  the  part  of  the  embedding  relevant 
to  the  entities).  Specifically,  we  start  with  m  ran¬ 
dom  vectors  to  generate  m-dimensional  PIC  embed¬ 
ding.  Since  we  have  two  bipartite  graphs,  entity- 
tableColumn  and  entity-suchasConcept,  we  create 
PIC  embeddings  for  each  of  them  separately.  The 
embedding  for  entities  is  then  the  concatenation  of 
these  separate  embeddings  (refer  to  Algorithm  1). 
Below  we  will  call  this  the  PIC3  embedding. 

Figure  2  shows  the  schematic  diagrams  for  final 
and  intermediate  matrices  while  creating  the  PIC3 
embedding.  We  have  experimented  with  a  version 
of  this  algorithm  in  which  we  create  PIC  embed¬ 
dings  of  the  data  by  concatenating  the  dimensions 
first  instead  of  computing  separate  embeddings  and 
later  concatenating  them.  We  observed  that  the  ver¬ 
sion  showed  in  Algorithm  1  performs  as  good  as  or 
better  than  its  variant. 


Algorithm  1  Create  PIC 3  embedding 
1:  function  Create _PIC3_Embedding(£',  At,  X$,  m): 

Ap/C3 

2:  Input:  E:  Set  of  all  entities, 

Xt :  Co-occunence  of  E  in  table-columns, 

A’ 5 :  Co-occurrence  of  E  with  suchasConcepts, 
m :  Number  of  PIC  dimensions  per  bipartite  graph 
3:  Output:  Xpic3'-  2*m  dim.  PIC3  embedding  of  E. 
4:  Xpic3  = 

5:  t  =  a  small  positive  integer 
6:  for  d  =  1  :  m  do 

7:  Vo  =  randomly  initialized  vector  of  size  \E\  *  1 

8:  Vt  =  PIC-Embedding(XT,  Vo,  t ) 

9:  Add  V)  as  dth  column  in  Xpjc3 

10:  end  for 
1 1 :  for  d  =  1  :  m  do 

12:  Vo  =  randomly  initialized  vector  of  size  \E\  *  1 

13:  Vt  =  PIC-Embedding(X s ,  Vo,  t) 

14:  Add  Vt  as  dth  column  in  A pic3 

15:  end  for 
16:  end  function 


Our  hypothesis  is  that  these  embeddings  will  clus¬ 
ter  similar  entities  together.  E.g.  Figure  3  shows 
a  one  dimensional  PIC  embedding  of  entities  be¬ 
longing  to  the  two  classes  “city”  and  “celebrity”. 
The  value  of  embedding  is  plotted  against  its  entity- 
index,  and  color  indicates  the  class  of  an  entity.  We 
can  clearly  see  that  most  entities  belonging  to  the 
same  class  are  clustered  together.  In  the  next  sec¬ 
tion,  we  will  discuss  how  the  PIC3  embedding  can 
be  used  for  various  semi-supervised  and  unsuper¬ 
vised  tasks. 


n  *  m  PIC  embedding 
m  « t 
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Figure  2:  Schematic  diagram  of  matrices  in  the  process 
of  creating  PIC3  representation  (n  :  number  of  entities,  t : 
number  of  table-columns,  s  :  number  of  suchas  concepts 
and  m  :  number  of  PIC  dimensions  per  bipartite  graph). 
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Figure  3:  One  dimensional  PIC  embedding  for  ‘City’  and 
‘Celebrity’  classes 


3  Using  the  PIC3  Representation 

In  this  section  we  will  see  how  this  PIC3  representa¬ 
tion  for  entities  can  be  used  for  three  different  tasks. 

3.1  Semi-supervised  Learning 

In  semi-supervised  transductive  learning,  a  few  en¬ 
tities  of  each  class  are  labeled,  and  learning  method 
extrapolates  these  labels  to  a  larger  number  of  un¬ 
labeled  data  points.  To  use  the  PIC3  representa¬ 
tion  for  this  task,  we  simply  learn  a  linear  classifier 
in  the  embedded  space.  In  the  experiments  below, 
we  experiment  with  using  labeled  entities  Yn  from 
the  NELL  Knowledge  Base  (Carlson  et  ah,  2010). 
We  note  that  once  the  PIC3  representation  has  been 
built,  this  approach  is  much  more  efficient  than  ap¬ 
plying  graph-based  iterative  semi-supervised  learn¬ 
ing  methods  (Talukdar  and  Crammer,  2009;  Carlson 
et  al„  2010). 

3.2  Set  Expansion 

Set  expansion  refers  to  the  problem  of  expanding  a 
set  of  seed  entities  into  a  larger  set  of  entities  of  the 
same  type.  To  perform  set  expansion  with  PIC3  rep¬ 
resentation,  we  find  the  K  nearest  neighbors  of  the 
centroid  of  the  set  of  seed  entities  using  a  KD-tree 
(refer  to  Algorithm  2).  Again,  this  approach  is  more 
efficient  at  query  time  than  prior  approaches  such 
as  SEAL  (Wang  and  Cohen,  2009b),  which  ranks 
nodes  within  a  graph  it  builds  on-the-fly  at  set  ex¬ 
pansion  time  using  queries  to  the  web. 


Algorithm  2  Set  Expansion  with  K-NN  on  PIC3 
1 :  Input:  Q:  seed  entities  for  set  expansion  , 

Xpic :  low  dimensional  PIC3  embedding  of  E 
2:  Output:  Q’  :  Expanded  entity  set 
3:  k  =  a  large  positive  number 
4:  Qc  =  centroid  of  entities  in  Q 
5:  Q’  =  Find-K-NearestNbr(Qc,  Xpjc,  k) 


3.3  Automatic  Set  Instance  Acquisition  (ASIA) 

This  task  takes  as  input  the  name  of  a  semantic  class 
(e.g., “countries”)  and  automatically  outputs  its  in¬ 
stances  (e.g.,  “USA”  ,  “India”  ,  “China”  etc.).  To 
perform  this  task,  we  look  up  instances  of  the  given 
class  in  the  hyponym  dataset,  and  then  perform  set 
expansion  on  these  -  a  process  analogous  to  that  used 


Dataset 

Toy  .Apple 

Delicious. 

Sports 

|  X  |  :  #  entities 

14,996 

438 

|C|  :  #  table-columns 

156 

925 

|  ( x ,  c)  |  :  #  (x,  c)  edges 

176,598 

9192 

|  Ys  | :  #  suchasConcepts 

2348 

1649 

\{x,  Ys)|:  #  (x,Ys)  edges 

7683 

4799 

|  Yn  | :  #  NELL  Classes 

11 

3 

\{x,Yn)  |:  #  (x,Yn)  pairs 

419 

39 

|  Yc  | :  #  manual  column  labels 

31 

30 

{c,Ycy.  #  (c,  Yc)  pairs 

156 

925 

Table  1 :  Table  datasets  Statistics 


in  prior  work  (Wang  and  Cohen,  2009a).  Here,  how¬ 
ever,  we  again  use  Algorithm  2  for  set  expansion,  so 
the  entity-suchasConcept  data  are  used  only  to  find 
seeds  for  a  particular  class  Y.  Again  this  method 
requires  minimal  resources  at  query  time. 

4  Experiments 

Although  PIC  algorithm  is  very  scalable,  in  this  pa¬ 
per  we  evaluate  performance  using  smaller  datasets 
which  are  extensively  labeled.  In  particular,  we 
use  the  Delicious_Sports,  Toy  .Apple  and  Hyponym 
Concept  datasets  made  publicly  available  by  (Dalvi 
et  ah,  2012)  to  evaluate  our  techniques.  Table  1 
shows  some  statistics  about  these  datasets.  Numbers 
for  Fs  and  (x,  Ys )  \  are  derived  using  the  Hyponym 
Concept  Dataset. 

4.1  Semi-supervised  Learning  (SSL) 

To  evaluate  the  PIC  embeddings  in  terms  of  pre¬ 
dicting  NELL  concept-names,  we  compare  the  per¬ 
formance  of  an  SVM  classifier  on  PIC  embed¬ 
dings  (named  SVM+PIC3)  vs.  the  original  high¬ 
dimensional  dataset  (named  SVM-baseline).  In 
SVM-baseline  method  the  hyponyms  and  table- 
columns  associated  with  an  entity  are  simply  used 
as  features.  The  number  of  iterations  t  for  PIC 
and  number  of  dimensions  per  view  m  were  set  to 
t  =  m  =  5  in  these  experiments.  (Experiments  with 
m  >  5  showed  no  significant  improvements  in  ac¬ 
curacy  on  Toy  .Apple  dataset.) 

Ligure  4  shows  the  plot  of  accuracy  vs.  training 
size  for  both  datasets.  We  can  see  that  SVM+PIC3 
method  is  better  than  SVM-Baseline  with  less  train¬ 
ing  data,  hence  is  better  in  SSL  scenarios.  Also 
note  that  PIC3  embedding  reduces  the  number  of 
dimensions  from  2574  (Delicious.Sports)  and  2504 


(Toy .Apple)  to  merely  10  dimensions.  We  checked 
the  rank  of  the  matrix  which  we  use  as  PIC3  repre¬ 
sentation  to  make  sure  that  all  the  PIC  embeddings 
are  distinct.  In  our  experiments  we  found  that  an 
m  dimensional  embedding  always  has  rank  =  m. 
This  is  achieved  by  generating  a  new  random  vector 
Vo  using  distinct  randomization  seeds  each  time  we 
call  the  PIC  embedding  procedure  (see  Line  7  and 
12  in  Algorithm  1). 

Delicious  Sports  (entity-col-suchas)  dataset 


(a)  Delicious.Sports  dataset 

Toy  Apple  (entity-col-suchas)  dataset 


(b)  ToyApple  dataset 


Figure  4:  SSL  Task  :  Comparison  of  SVM+PIC3  vs. 
SVM-Baseline 


4.2  Set  Expansion 

We  manually  labeled  every  table  column  from  De- 
liciousdSports  and  Toy  Apple  datasets.  These  la¬ 
bels  are  referred  to  as  Yc  in  Table  1.  This  also 
gives  us  labels  for  the  entities  residing  in  these  table- 
columns.  We  use  the  set  of  entities  in  each  of  these 
table  columns  as  “a  set  expansion  query”  and  eval¬ 
uate  “the  expanded  set  of  entities”  based  on  manual 
labels.  The  baseline  runs  K-Nearest  Neighbor  on  the 
original  high-dimensional  dataset  (referred  to  as  K- 
NN-Baseline). 


As  another  baseline,  we  adapt  the  MAD  algo¬ 
rithm  (Talukdar  and  Crammer,  2009),  a  state-of-the 
art  semi-supervised  learning  method.  Similar  to  a 
prior  work  by  (Talukdar  et  al.,  2010),  we  adapt  MAD 
for  unsupervised  learning  by  associating  each  table- 
column  node  with  its  own  id  as  a  label,  and  prop¬ 
agating  these  labels  to  other  table-columns.  MAD 
also  includes  a  “dummy  label”,  so  after  propaga¬ 
tion  every  table-column  Tq  will  be  labeled  with  a 
weighted  set  of  table-column  ids  TSl,  ■■■TSn  (includ¬ 
ing  its  own  id),  and  also  have  a  weight  for  the 
“dummy  label”.  We  denote  MAD’s  weight  for  as¬ 
sociating  table-column  id  Ts  with  table  column  Tq 
as  P(Ts\Tq),  and  consider  the  ids  TSl,...TSk  with 
a  weight  higher  than  the  dummy  label’s  weight. 
We  consider  ei,  e^,  ...  en,  the  union  of  enti¬ 
ties  present  in  columns  TSl...TSk,  and  rank  them 
in  descending  order  score,  where  score(ei,Tq )  = 
J^j:ei£Ts ,p(Tsj\Tq).  Figure  5  shows  Precision- 
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Figure  5:  Set  Expansion  Task  :  Precision  recall  curves 

Recall  curves  for  all  3  methods,  on  sample  set  ex¬ 
pansion  queries.  These  plots  are  annotated  with 
manual  column-labels  (Yc).  For  most  of  the  queries, 
K-NN+PIC3  performs  as  well  as  K-NN-Baseline 
and  is  comparable  to  MAD  algorithm.  Table  2 
shows  the  running  time  for  all  three  methods.  K- 
NN+PIC3  method  incurs  a  small  amount  of  pre¬ 
processing  time  (0.02  seconds)  to  create  embeddings 
and  compared  to  other  two  methods  it  is  very  fast  at 
the  query  time.  The  numbers  show  total  query  time 
for  881  Set  Expansion  queries  and  25  ASIA  queries 
(described  below). 


Method 

Total  Query  Time  (s) 

Set  Expansion 

ASIA 

K-NN+PIC3 

12.7 

0.5 

K-NN-Baseline 

80.1 

1.4 

MAD 

38.2 

150.0 

Table  2:  Comparison  of  Run  Time  on  Delicious  JSports 


4.3  Automatic  Set  Instance  Acquisition 

For  the  automatic  set  instance  acquisition  (ASIA) 
task,  we  use  concept-names  from  Hyponym  Concept 
Dataset  ( Ys )  as  queries.  Similar  to  the  Set  Expan¬ 
sion  task,  we  compare  K-NN+PIC3  to  the  K-NN- 
Baseline  and  MAD  methods. 

To  use  MAD  for  this  task,  the  concept  name  Ys 
is  injected  as  label  for  the  ten  entities  that  co-occur 
most  with  Ys,  and  the  label  propagation  algorithm 
is  run.  Each  entity  e,  that  scores  higher  than  the 
dummy  label  is  then  ranked  based  on  the  probability 
of  the  label  Ys  for  that  entity. 

Figure  6  shows  the  comparison  of  all  three  meth¬ 
ods.  K-NN+PIC3  generally  outperforms  K-NN- 
Baseline,  and  outperforms  MAD  on  two  of  the  four 
queries.  MAD’s  improvements  over  K-NN+PIC3 
for  the  two  queries  comes  at  the  expense  of  longer 
query  times  (refer  to  Table  2). 


5  Conclusion 

We  presented  a  novel  low-dimensional  representa¬ 
tion  for  entities  on  the  Web  using  Power  Iteration 
Clustering.  Our  experiments  show  encouraging  re¬ 
sults  on  using  this  representation  for  three  different 
tasks  :  (a)  Semi-Supervised  Learning,  (b)  Set  Ex¬ 
pansion  and  (c)  Automatic  Set  Instance  Acquisition. 
The  experiments  show  that  “this  simple  representa¬ 
tion  (PIC3)  can  go  a  long  way,  and  can  solve  dif¬ 
ferent  problems  in  a  simpler  and  faster,  if  not  better 
way”.  In  future,  we  would  like  to  use  this  represen¬ 
tation  for  named-entity  disambiguation  and  unsuper¬ 
vised  class-instance  pair  acquisition,  and  to  explore 
performance  on  larger  datasets. 
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Figure  6:  ASIA  task  :  Precision  recall  curves 
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